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Finale 





Lecture 15: Matrix Factorization 
linear models of movies on extracted user 
features (or vice versa) jointly optimized with 
stochastic gradient descent 













Lecture 16: Finale 
e Feature Exploitation Techniques 
e Error Optimization Techniques 

e Overfitting Elimination Techniques 
e Machine Learning in Practice 
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Finale Feature Exploitation Techniques 


Exploiting Numerous Features via Kernel 


numerous features within some ®: 
embedded in kernel Kg with inner product operation 


Polynomial Kernel 


'scaled' polynomial 
transforms 


Sum of Kernels 


transform union 


Stump Kernel 


infinite-dimensional decision-stumps as 
transforms transforms 


Mercer Kernels 


transform combination | transform implicitly 





possibly K 
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kernel logistic 
regression 
probabilistic SVM 
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Finale Feature Exploitation Techniques 


Exploiting Predictive Features via Aggregation 
predictive features within some 6: 


p(x) = gi(Xx) 


simplest perceptron; | branching (divide) + prototype (center) + 
simplest DecTree leaves (conquer) influence 


Bagging; 
Random Forest 








possibly À 
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Finale Feature Exploitation Techniques 


Exploiting Hidden Features via Extraction 


hidden features within some ®: 
as hidden variables to be ‘jointly’ optimized with usual weights 


—possibly with the help of unsupervised learning 





Neural Network; RBF Network Matrix Factorization 





Deep Learning 
neuron weights RBF centers user/movie factors 










AdaBoost; Autoencoder; 
GradientBoost PCA 


gt parameters cluster centers ‘basis’ directions 















possibly GradientBoosted Neurons, 
NNet on Factorized Features, ... 
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Finale Feature Exploitation Techniques 


Exploiting Low-Dim. Features via Compression 


low-dimensional features within some ©: 
compressed from original features 


Autoencoder;PCA 


info.-preserving 
compression 
























Decision Stump; Random Forest 
DecTree Branching | Tree Branching 
‘random’ low-dim. 
projection 






‘best’ naive projection 


to R Matrix Factorization 


projection from 
abstract to concrete 


Feature Selection 
‘most-helpful’ low-dimensional projection 


possibly other ‘dimension reduction’ models | 


Finale Feature Exploitation Techniques 


Fun Time 


Consider running AdaBoost-Stump on a PCA-preprocessed data set. 
Then, in terms of the original features x, what does the final hypothesis 
G(X) look like? 

© a neural network with tanh(-) in the hidden neurons 

6 a neural network with sign(-) in the hidden neurons 

© a decision tree 


€ a random forest 





Finale Feature Exploitation Techniques 


Fun Time 


Consider running AdaBoost-Stump on a PCA-preprocessed data set. 
Then, in terms of the original features x, what does the final hypothesis 


G(x) look like? 
@ a neural network with tanh(-) in the hidden neurons 
6 a neural network with sign(-) in the hidden neurons 
© a decision tree 
€ a random forest ] 





Reference Answer: (2) 


PCA results in a linear transformation of x. 
Then, when applying a decision stump on the 
transformed data, it is as if a perceptron is 
applied on the original data. So the resulting G 
is simply a linear aggregation of perceptrons. | 
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Finale Error Optimization Techniques 


Numerical Optimization via Gradient Descent 
when VE ‘approximately’ defined, use it for 1st order approximation: 


new variables = old variables — nV E 





SGD/Minibatch/GD ! Steepest Descent 
(Kernel) LogReg; AdaBoost; 
GradientBoost 


| Functional GD 
AdaBoost; 


GradientBoost 






Neural Network 
[backprop]; 


Matrix Factorization; 
Linear SVM (maybe) 


possibly 2nd order techniques, 
GD under constraints, ... | 





Finale Error Optimization Techniques 


Indirect Optimization via Equivalent Solution 


when difficult to solve original problem, 
seek for equivalent solution | 


Dual SVM Kernel LogReg 






Kernel RidgeReg 


equivalence via 
representer 






equivalence via 
convex QP 






equivalence to 
eigenproblem 


some other boosting models and modern 
solvers of kernel models rely on such a 
technique heavily 
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Finale Error Optimization Techniques 


Complicated Optimization via Multiple Steps 
when difficult to solve original problem, 
seek for ‘easier’ sub-problems | 


Alternating Optim. Divide & Conquer 
probabilistic SVM; k-Means; decision tree; 


linear blending; alternating LeastSqr; 


stacking; (steepest descent) 
RBF Network; 


DeepNet pre-training 





useful for complicated models | 


Finale Error Optimization Techniques 


Fun Time 


When running the DeepNet algorithm introduced in Lecture 213 on a 
PCA-preprocessed data set, which optimization technique is used? 


@ variants of gradient-descent 
@ locating equivalent solutions 
© multi-stage optimization 


@ all of the above 





Finale Error Optimization Techniques 


Fun Time 


When running the DeepNet algorithm introduced in Lecture 213 on a 
PCA-preprocessed data set, which optimization technique is used? 
Q variants of gradient-descent 
@ locating equivalent solutions 
© multi-stage optimization 
@ all of the above 








Reference Answer: (4) 


minibatch GD for training; equivalent 
eigenproblem solution for PCA; multi-stage for 


pre-training 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 


10/21 


Finale Overfitting Elimination Techniques 


Overfitting Elimination via Regularization 


when model too ‘powerful’: 
add brakes somewhere | 


large-margin 
SVM; 
AdaBoost (indirectly) 






voting/averaging 
uniform blending; 
Bagging; 

Random Forest 
















SVR; 
kernel models; 
NNet [weight-decay] 






denoising 
autoencoder 








'weight-elimination — | constraining 


autoenc. [weights]; 


. RBF [# centers]; 
early stopping 
decision tree NNet (any GD-like) 


arguably most important techniques | 
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Finale Overfitting Elimination Techniques 


Overfitting Elimination via Validation 


when model too ‘powerful’: 
check performance carefully and honestly | 


Internal Validation 
blending; 










SVM/SVR Random Forest 





DecTree pruning 


simple but necessary | 


Finale Overfitting Elimination Techniques 


Fun Time 


What is the major technique for eliminating overfitting in Random 
Forest? 


@ voting/averaging 
@ pruning 

© early stopping 

© weight-elimination 





Finale Overfitting Elimination Techniques 


Fun Time 


What is the major technique for eliminating overfitting in Random 
Forest? 


@ voting/averaging 
@ pruning 

© early stopping 

© weight-elimination 











Reference Answer: (1) 


Random Forest, based on uniform blending, | 
relies on voting/averaging for regularization. | 
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Finale Machine Learning in Practice 


NTU KDDCup 2010 World Champion Model 


Feature engineering and classifier ensemble for KDD Cup 2010, 
Yu et al., KDDCup 2010 | 


linear blending of | 


Logistic Regression 4 Random Forest 
many rawly encoded features human-designed features 


yes, you've learned everything! :-) | 





Finale Machine Learning in Practice 


NTU KDDCup 2011 Track 1 World Champion Model 
A linear ensemble of individual and blended models for music rating 
prediction, Chen et al., KDDCup 2011 | 


NNet, DecTree-like, and then linear blending of 


e Matrix Factorization variants, including probabilistic PCA 
Restricted Boltzmann Machines: an ‘extended’ autoencoder 
e k Nearest Neighbors 


e Probabilistic Latent Semantic Analysis: 
an extraction model that has ‘soft clusters’ as hidden variables 


linear regression, NNet, & GBDT 





yes, you can ‘easily’ 
understand everything! :-) | 


Finale Machine Learning in Practice 


NTU KDDCup 2012 Track 2 World Champion Model 
A two-stage ensemble of diverse models for advertisement ranking in 
KDD Cup 2012, Wu et al., KDDCup 2012 | 


NNet, GBDT-like, and then linear blending of 


* Linear Regression variants, including linear SVR 
* Logistic Regression variants 
e Matrix Factorization variants 





‘key’ is to blend properly without overfitting | 


Finale Machine Learning in Practice 


NTU KDDCup 2013 Track 1 World Champion Model 
Combination of feature engineering and ranking models for paper- 
author identification in KDD Cup 2013, Li et al., KDDCup 2013 | 


linear blending of 


e Random Forest with many many many trees 
e GBDT variants 


with tons of efforts in designing features 


‘another key’ is to construct features with 
domain knowledge | 





Finale Machine Learning in Practice 


ICDM 2006 Top 10 Data Mining Algorithms 


@ C4.5: another decision © PageRank: for 
tree link-analysis, similar to 

@ k-Means matrix factorization 

© SVM @ AdaBoost 

© Apriori: for frequent itemset © k Nearest Neighbor 
mining © Naive Bayes: a simple 

o EM: 'alternating linear model with ‘weights’ 


some models 


q» C&RT 





personal view of five missing ML competitors: 
LinReg, LogReg, 
Random Forest, GBDT, NNet 
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Finale Machine Learning in Practice 


Machine Learning Jungle 


bagging decision tree support vector machine neural network kernel 


AdaBoost 299"egation sparsity autoencoder functional gradient 


dual Uniform blending deep learning nearest neighbor decision stump 


kernel LogReg large-margin Prototype quadratic programming SVR 
GBDT PCA random forest Matrix factorization Gaussian kernel 
k-means OOB error RBF network probabilistic SVM 


soft-margin 





welcome to the jungle! | 


Finale Machine Learning in Practice 


Fun Time 


Which of the following is the official lucky number of this class? 
© 9876 
@ 1234 
© 1126 
© 6211 





Finale Machine Learning in Practice 


Fun Time 
Which of the following is the official lucky number of this class? 
© 9876 
@ 1234 
© 1126 


© 6211 









Reference Answer: 9 
May the luckiness always be with you! 


Summary 
@ Embedding Numerous Features: Kernel Models 


Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 














Lecture 16: Finale 


e Feature Exploitation Techniques 
kernel, aggregation, extraction, low-dimensional 


e Error Optimization Techniques 





gradient, equivalence, stages 
e Overfitting Elimination Techniques 
(lots of) regularization, validation 


e Machine Learning in Practice 
welcome to the jungle 





e next: happy learning! 


